Medical fraud is a major contributor to inflating expenses and waste in the US healthcare system. Forbes identifies the industry as among the most vulnerable to fraud & cyber-attacks. This investigation explores physician-level Medicare and Medicaid data through exploratory data analysis and a classfication model, hoping to show how big data can be used to prevent this problem.
The inspiration for this project came from my coleague at IBM, whose mother was a victim of medical ID theft. Although the problem of ID theft is slightly different (would require transctional data), CMS and LEIE data are among the best publicly available sources for healthcare and thus were used as a proof-of-concept. It reaffirms that, similar to credit card fraud detection, fraudulent practices in healthcare can be prevented and alleviated. The most vulnerable demographics, dependent elders of 65 years and older, would benefit the most from this technology.
Healthcare fraud in the news:
Data sources:
In cleaning and combining the two datasets, I followed closely previous literature, which are both listed in the "References" section.
import os
import pandas as pd
# Set up data directory
CWD = os.getcwd()
cms_data_dir = os.path.join(CWD, 'CMSData')
# Some years columns are capitalized and other years the columns are lowercase:
capitalization_dict = {
'2012': str.upper,
'2013': str.upper,
'2014': str.lower,
'2015': str.lower,
'2016': str.upper,
'2017': str.lower,
}
# Set dtypes
partB_dtypes = {
'npi': 'str',
'nppes_provider_last_org_name': 'str',
'nppes_provider_first_name': 'str',
'nppes_provider_mi': 'str',
'nppes_credentials': 'str',
'nppes_provider_gender': 'str',
'nppes_entity_code': 'str',
'nppes_provider_street1': 'str',
'nppes_provider_street2': 'str',
'nppes_provider_city': 'str',
'nppes_provider_zip': 'str',
'nppes_provider_state': 'str',
'nppes_provider_country': 'str',
'provider_type': 'str',
'medicare_participation_indicator': 'str',
'place_of_service': 'str',
'hcpcs_code': 'str',
'hcpcs_description': 'str',
'hcpcs_drug_indicator': 'str',
'line_srvc_cnt': 'float64',
'bene_unique_cnt': 'float64',
'bene_day_srvc_cnt': 'float64',
'average_medicare_allowed_amt': 'float64',
'average_submitted_chrg_amt': 'float64',
'average_medicare_payment_amt': 'float64',
'average_medicare_standard_amt': 'float64',
}
# Get dfs for all years - take a few minutes to run
years = ['2012','2013','2015','2016']
dfs = []
for year in years:
file = os.path.join(cms_data_dir, f'cms{year}.txt')
dtypes = dict(zip(list(map(capitalization_dict[year], partB_dtypes.keys())), list(partB_dtypes.values()))) #get correct column capitalization and dtype
df = pd.read_csv(file, delimiter='\t', dtype=dtypes)
df.columns = map(str.lower, df.columns) # make all variable names lowercase
df['year'] = year #add Year column
dfs.append(df)
# Concatenate
partB_df = pd.concat(dfs, axis=0, ignore_index=True, sort=False)
partB_df.shape
# Remove rows corresponding to drugs because LINE_SRVC_CNT for them is not a desirable count
partB_df = partB_df[(partB_df['hcpcs_drug_indicator'] == 'N')]
partB_df.shape
# Drop missing NPI and HCPCS
# This means dropping 2014 and 2016 - both did not have HCPCS Code
partB_df = partB_df.dropna(subset = ['npi','hcpcs_code'])
partB_df.shape
# Keep variables based on "Medicare fraud detection using neural networks" (Johnson, Khoshgoftaar 2019)
partB_variables_to_keep = [
'npi',
'provider_type',
'nppes_provider_city', # keep
'nppes_provider_zip', # keep
'nppes_provider_state', # keep
'nppes_provider_country', # keep
'hcpcs_code', # not in paper but kept
'hcpcs_description', # not in paper but kept
'hcpcs_drug_indicator', # not in paper but kept
'place_of_service', # not in paper but kept
'nppes_provider_gender',
'line_srvc_cnt',
'bene_unique_cnt',
'bene_day_srvc_cnt',
'average_submitted_chrg_amt',
'average_medicare_payment_amt',
'year' # need Year for labeling
]
partB_df = partB_df[partB_variables_to_keep]
partB_df.head()
partB_df.loc[partB_df['npi'] == '1003000142'][['npi',
'provider_type',
'place_of_service',
'line_srvc_cnt',
'average_submitted_chrg_amt',
'year']][:5]
partB_df['year'].value_counts()
# Write all combined CMS to csv
#partB_df.to_csv('combined-partB-data-v2')
leie_data_dir = os.path.join(CWD, 'LEIEData')
leie_dtypes = {
'LASTNAME': 'str',
'FIRSTNAME': 'str',
'MIDNAME': 'str',
'BUSNAME' : 'str',
'GENERAL': 'str',
'SPECIALTY': 'str',
'UPIN': 'str',
'NPI': 'str',
'DOB': 'str',
'ADDRESS': 'str',
'CITY': 'str',
'STATE': 'str',
'ZIP': 'str',
'EXCLTYPE': 'str',
'EXCLDATE': 'int64',
'REINDATE': 'int64',
'WAIVERDATE': 'int64',
'WVRSTATE': 'str',
}
#LEIE data is monthly between 01/2018 (1801) - 12/2019 (1912)
year_months = ['1801','1802','1803','1804','1805','1806','1807','1808','1809','1810','1811','1812',
'1901','1902','1903','1904','1905','1906','1907','1908','1909','1910','1911','1912']
dfs = []
for year_month in year_months:
file = os.path.join(leie_data_dir, f'leie{year_month}-excl.csv')
df = pd.read_csv(file, dtype=leie_dtypes)
df.columns = map(str.lower, df.columns)
dfs.append(df)
# Concatenate
leie_df = pd.concat(dfs, axis=0, ignore_index=True, sort=False)
leie_df.shape
leie_df.head()
# Drop NPI = 0, which means missing - A LOT ARE MISSING, which is a problem for the data
leie_df = leie_df[leie_df['npi'] != 0]
leie_df.shape
# Keep exclusions most related to Fraud
exclusions_to_keep = [
'1128a1',
'1128a2',
'1128a3',
'1128b4',
'1128b7',
'1128c3Gi',
'1128c3gii',
]
leie_df = leie_df[leie_df['excltype'].isin(exclusions_to_keep)]
leie_df.shape
leie_df['excltype'].value_counts()
Type definitions of physician fraud:
# Write all combined LEIE to csv
#partB_df.to_csv('combined-leie-data')
from datetime import datetime, timedelta
import numpy as np
# Convert to datetime
leie_df['excldate'] = pd.to_datetime(leie_df['excldate'], format='%Y%m%d', errors ='ignore')
# Round excl date to the nearest year Johnson & Khoshgoftaar (2019)
def round_to_year(dt=None):
year = dt.year
month = dt.month
if month >= 6:
year = year + 1
return datetime(year=year,month=1,day=1)
leie_df['excl_year'] = leie_df.excldate.apply(lambda x: round_to_year(x))
# Make exclusion dict
# 1215053665 has 2 exclusions, so sort also df to get latest year
excl_year_dict = dict([npi, year] for npi, year in zip(leie_df.sort_values(by='excl_year').npi, leie_df.sort_values(by='excl_year').excl_year))
# Get label as 0 or 1
partB_df['excl_year'] = partB_df['npi'].map(excl_year_dict)
partB_df['excl_year'] = partB_df['excl_year'].fillna(datetime(year=1900,month=1,day=1)) # fill NaN, physicians without exclusion, with year 1900
partB_df['year'] = pd.to_datetime(partB_df['year'].astype(str), format='%Y', errors ='ignore')
partB_df['fraudulent'] = np.where(partB_df['year'] < partB_df['excl_year'], 1, 0) # compare year vs. exclusion year to get Fraudulent
print("partB_df is our combined dataset with shape: {0}".format(partB_df.shape))
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.figure_factory as ff
import plotly.graph_objects as go
from plotly.subplots import make_subplots
# Get number and amount of fraudulent services
partB_df['fraud_line_srvc_cnt'] = partB_df['line_srvc_cnt']*partB_df['fraudulent']
partB_df['fraud_average_submitted_chrg_amt'] = partB_df['average_submitted_chrg_amt']*partB_df['fraudulent']
partB_df['fraud_average_medicare_payment_amt'] = partB_df['average_medicare_payment_amt']*partB_df['fraudulent']
# Aggregate by state
state_df = partB_df.groupby('nppes_provider_state').agg({
'line_srvc_cnt':[('total_services_count','sum')],
'fraud_line_srvc_cnt':[('total_fraud_services_count','sum')]
}).reset_index()
# Drop multi-index
state_df.columns = ['_'.join(col) for col in state_df.columns]
state_df.columns = ['provider_state', 'total_services_count', 'fraud_services_count']
# Get % fraud
state_df['fraud_services_pct'] = state_df['fraud_services_count']/state_df['total_services_count']
state_df.head()
np.seterr(divide = 'ignore')
fig = go.Figure(data=go.Choropleth(
locations=state_df['provider_state'],
z = np.log(state_df['fraud_services_pct'].astype(float))+0.000001, #log-scale
locationmode = 'USA-states',
colorscale = 'Reds',
colorbar_title = "Logged %",
marker_line_color='white'
))
fig.update_layout(
title_text = 'Logged Percentage of Medicare Fraudulent Service by US State (2012-2016)',
geo_scope='usa',
)
fig.show()
Heat map shows a high concentration of fraudulent Medicare claims around the Northeast and South of the US, with Texas leading all states by a wide margin.
# Get dummy variables for gender
partB_df2 = pd.concat([partB_df, pd.get_dummies(partB_df['nppes_provider_gender'])], axis=1)
# Aggregate by provider type
type_df = partB_df2.groupby('provider_type').agg({
'line_srvc_cnt':['sum'],
'fraud_line_srvc_cnt':['sum'],
'M':['sum'],
'F':['sum'],
'average_submitted_chrg_amt':['median'], #since distribution skewed right
'average_medicare_payment_amt':['median'],
'fraud_average_submitted_chrg_amt':['max'],
'fraud_average_medicare_payment_amt':['max']
}).reset_index()
# Drop multi-index
type_df.columns = ['_'.join(col) for col in type_df.columns]
type_df.columns = ['provider_type', 'total_services_count', 'fraud_services_count','male_count','female_count',
'avg_submitted_chrg_amt','avg_medicare_payment_amt', 'fraud_avg_submitted_chrg_amt','fraud_avg_medicare_payment_amt']
# Sorting
type_df = type_df.sort_values('fraud_services_count',ascending=False)[:15] #get top 15 fraudulent types
type_df = type_df.sort_values('total_services_count',ascending=True).reset_index(drop=True) #re-sort by total services
# Add some fields
type_df['non_fraud_services_count'] = type_df['total_services_count'] - type_df['fraud_services_count']
type_df['fraud_services_pct'] = (type_df['fraud_services_count']/type_df['total_services_count'])*100
type_df.head()
# Get 2015 data only for speed
partB_2015_df = partB_df2[partB_df2['year'] == datetime(year=2015,month=1,day=1)]
top_7_types = type_df['provider_type'][:7].tolist()
for p_type in top_7_types:
#Eliminate 0 payments then log
x = np.log(partB_2015_df[(partB_2015_df.provider_type == p_type) & (partB_2015_df.average_submitted_chrg_amt!=0)]['average_submitted_chrg_amt'])
fraud_x = np.log(partB_2015_df[(partB_2015_df.provider_type == p_type) & (partB_2015_df.fraud_average_submitted_chrg_amt!=0)]['fraud_average_submitted_chrg_amt'])
fig,ax = plt.subplots(figsize=(12,5))
sns.distplot(x,label='Non-fraudulent', hist=False, rug=False)
sns.distplot(fraud_x,label='Fraudulent', hist=False, rug=False)
ax.set(
title ='Distribution of Submitted Charge Amount - Fraud vs. Non-fraud - '+ p_type,
xlabel = 'Log of USD Payment Amount',
ylabel = 'Count'
)
plt.show()
As shown in histogram comparisons, no obvious evidence to say that the average fraudulent submitted charge is more expensive than non-fraudulent charge across the Top 15 Fraudulent categories.
This does not support the hypothesis I previously had - perhaps because doctors are pretty good at hiding wrongful claims and not making it obvious by listing unreasonably high costs.
for p_type in top_7_types:
#Eliminate 0 payments
x = partB_2015_df[(partB_2015_df.provider_type == p_type) & (partB_2015_df.average_medicare_payment_amt!=0)]['average_medicare_payment_amt']
fraud_x = partB_2015_df[(partB_2015_df.provider_type == p_type) & (partB_2015_df.fraud_average_medicare_payment_amt!=0)]['fraud_average_medicare_payment_amt']
fig,ax = plt.subplots(figsize=(12,5))
sns.distplot(x,label='Non-fraudulent', hist=False, rug=False)
sns.distplot(fraud_x,label='Fraudulent', hist=False, rug=False)
ax.set(
title ='Distribution of Medicare Payment Amount - Fraud vs. Non-fraud - '+ p_type,
xlabel = 'Log of USD Payment Amount',
ylabel = 'Count'
)
plt.show()
Same conclusion can be made here about Medicare Payment amount.
partB_2015_df = partB_df[partB_df['year'] == datetime(year=2015,month=1,day=1)]
fig, ax = plt.subplots(figsize=(15,7))
sns.heatmap(partB_2015_df.corr(method='pearson'), annot=True, fmt='.4f',
cmap=plt.get_cmap('coolwarm'), cbar=False, robust=True)
ax.set_yticklabels(ax.get_yticklabels(), rotation="horizontal")
ax.set(
title ='Correlation Heatmap in 2015 (only Continuous Variables)',
)
plt.show()
No insights are particularly interesting here, outside of the correlations we would normally expect.
# Get categorical variables (except location)
for col in ['nppes_provider_gender','provider_type']:
partB_2015_df = pd.concat([partB_2015_df, pd.get_dummies(partB_2015_df[col], drop_first= True)], axis=1)
partB_2015_df = partB_2015_df.drop(col, 1)
def get_redundant_pairs(df):
'''Get duplicate pairs to drop in correlation matrix after unstacking'''
pairs_to_drop = set()
cols = df.columns
for i in range(0, df.shape[1]):
for j in range(0, i+1):
pairs_to_drop.add((cols[i], cols[j]))
return pairs_to_drop
def get_correlations(df):
'''Get biggest correlations'''
au_corr = df.corr().unstack() #unstack
labels_to_drop = get_redundant_pairs(df)
au_corr = au_corr.drop(labels=labels_to_drop)
return au_corr
#get relevant columns
cols = partB_2015_df.columns[20:].tolist() + ['line_srvc_cnt','bene_unique_cnt','average_submitted_chrg_amt','fraudulent']
corr = get_correlations(partB_2015_df[cols])
corr = corr.to_frame().reset_index() #to frame
corr.columns = ['Variable A', 'Variable B', 'Correlation']
def color_df(value):
'''Color red if positive, green if negative'''
if value < 0:
color = 'red'
elif value > 0:
color = 'green'
else:
color = 'black'
return 'color: %s' % color
corr = corr.reindex(corr['Correlation'].abs().sort_values(ascending=False).index).reset_index(drop=True)
corr[:15].style.applymap(color_df, subset=['Correlation'])
Again, too many interesting conclusions can be made here, apart from the fact that some medical fields/types are dominated by certain gender. The very top, Line Service Count (number of services performed) and Unique Beneficiary Count (number of distinct beneficiaries), make sense as having a positive correlation over 0.5.
Most notable is that Ambulatory Surgical Center and Anesthesiology are quite "expensive" and very positively correlated to Submitted Charge Amount -- an important insight as they are among the high flyers of physician fraud.
fig = go.Figure()
col_layout_dict = {'non_fraud_services_count': ['Non-fraud Service','rgba(50, 171, 96, 0.6)'],
'fraud_services_count': ['Fraud Service','rgb(255, 0, 0)']} #dict for layout
for col in ['non_fraud_services_count','fraud_services_count']:
fig.add_trace(go.Bar(
y=type_df['provider_type'],
x=type_df[col],
name=col_layout_dict[col][0],
marker=dict(
color=col_layout_dict[col][1],
),
orientation='h',
))
fig.update_layout(
barmode = 'stack',
title = 'Top 15 Fraudulent Medicare Category by Service Count',
paper_bgcolor='white',
plot_bgcolor='white',
yaxis=dict(
showgrid=False,
showline=False,
showticklabels=True,
domain=[0, 0.95],
),
xaxis=dict(
zeroline=False,
showline=False,
showticklabels=True,
showgrid=True,
domain=[0, 0.95],
),
xaxis_title_text='Count',
)
annotations = [] #annotate with %
x = type_df['fraud_services_count']+type_df['non_fraud_services_count']+100000000
y_p = np.round(type_df['fraud_services_pct'].tolist(), decimals=2)
for y_p, x, y in zip(y_p,x,type_df['provider_type']):
annotations.append(dict(xref='x1', yref='y1',
y=y, x=x,
text=str(y_p) + '%',
font=dict(family='Arial', size=12,
color='rgb(255, 0, 0)'),
showarrow=False))
annotations.append(dict(xref='paper', yref='paper',
x=-0.2, y=-0.209,
text='Combined CMS and LEIE data' +
'to label the leading Fraudulent physician categories (15 Feb 2020)',
font=dict(family='Arial', size=10, color='rgb(150,150,150)'),
showarrow=False))
fig.update_layout(annotations=annotations)
fig.show()
By our definition, Anesthesiology, General Practice, Pain Management, and Getriatric Medicine have a high percentage of fraudulent services performed by excluded physicians.
Particularly, 65% of Pain Management cases were wrongfully performed (an overinflated statistic as we qualified ALL previous services performed by an excluded physician as fraudulent if they were on the list - the reality is unlikely for ALL). Regardless, it makes sense that the high flyers would be General Practice, Pain Mangement, and Geriatric Medicine (the elderlies are the most vulnerable).
fig = go.Figure()
#dict for layout
col_layout_dict = {'female_count': ['Female Physicians ','#ffcdd2'],
'male_count': ['Male Physicians','#A2D5F2']}
for col in ['female_count','male_count']:
fig.add_trace(go.Bar(
y=type_df['provider_type'],
x=type_df[col],
name=col_layout_dict[col][0],
marker=dict(
color=col_layout_dict[col][1],
),
orientation='h',
))
fig.update_layout(
barmode = 'stack',
title = 'Top 15 Fraudulent Medicare Category by Service Count and Gender',
paper_bgcolor='white',
plot_bgcolor='white',
yaxis=dict(
showgrid=False,
showline=False,
showticklabels=True,
domain=[0, 0.90],
),
xaxis=dict(
zeroline=False,
showline=False,
showticklabels=True,
showgrid=True,
domain=[0, 0.90],
),
xaxis_title_text='Count',
)
fig.show()
The top fraudulent fields/categories are dominated by Male physicians.
Clinical Laboratory and Ambulance Service Provider has a lot of line services but require fewer doctors - so they did not have enough quantity to show on the bar plot?
partB_df.columns
#drop variables that we are not using
partB_train_df = partB_df.drop(columns=[
'fraud_average_medicare_payment_amt', #for visual only
'fraud_average_submitted_chrg_amt', #for visual only
'fraud_line_srvc_cnt', #for visual only
'year',
'excl_year',
'hcpcs_drug_indicator',
'place_of_service',
'nppes_provider_city',
'nppes_provider_country',
'hcpcs_description',
'nppes_provider_zip'
])
a = partB_train_df.fraudulent.value_counts()
display(a)
print('Fraudulent physicians are: {0}% of all data'.format(str(np.round((a[1]/a[0])*100,decimals = 6))))
display(partB_train_df.head())
print("We have 9 features, including 4 categorical variables in \n {0}".format(partB_train_df.columns[1:5].tolist()))
def undersample(df, category, split=0.5):
'''Random undersampling based on a category and class split (% of dataset belonged to majority class)'''
fraud_number = df[category].sum()
non_fraud_indices = df[df[category] == 0].index #get indices of non-fraud
random_indices = np.random.choice(non_fraud_indices, size=int(round(fraud_number*(split/(1-split)), 0)),
replace=False) #sample at random without replacement
fraud_indices = df[df[category] == 1].index #get indices of fraud
under_sample_indices = np.concatenate([fraud_indices, random_indices])
return df.loc[under_sample_indices]
data = undersample(partB_train_df, 'fraudulent')
#print
a = data.fraudulent.value_counts()
display(a)
print('Fraudulent physicians are {0}% of all data.'.format(str(np.round((a[1]/(a[0]+a[1]))*100,decimals = 2))))
# # We are NOT normalizing the data
# from sklearn import preprocessing
# def scale_predictors(df):
# '''Takes in df and returns normalized data [0,1] '''
# x = df.values #numpy array
# min_max_scaler = preprocessing.MinMaxScaler()
# x_scaled = min_max_scaler.fit_transform(x)
# return pd.DataFrame(x_scaled, columns=df.columns, index=df.index)
# One-hot encoding
def one_hot_encode(df, cols):
'''Encodes categorical variables in df with cols as an array'''
for col in cols:
df = pd.concat([df, pd.get_dummies(df[col], drop_first= True)], axis=1)
return df
cols = ['provider_type', 'nppes_provider_state', 'hcpcs_code', 'nppes_provider_gender']
data = one_hot_encode(data, cols)
data = data.drop(columns=cols, axis=1).reset_index(drop=True) #drop column that's been encoded
display(data.shape)
print('There are {0} predictors after turning categories into dummy variables.'.format(data.shape[1]-2))
data.head()
data.fraudulent.value_counts()
import statsmodels.api as sm
cols = [col for col in data.columns if col not in ['npi', 'fraudulent']]
X = data[cols]
y = data['fraudulent']
logit_model=sm.Logit(y,X)
result=logit_model.fit(method='bfgs')
print(result.summary2())
1) Split train vs. test set - understand how well a model performs and its evaluation metrics (precision, recall, AUC). Variable transformation, for example, is a step towards achieving a better model.
2) Here, we only tried a 50:50 class distribution for fraud vs. non fraud. It will be worth to experiment with different class splits and compare results.
3) Try other classification models that previous literature has found to be productive (XGBoost, SVM).
Coding:
Papers: